American Studies Seminar / Primary Source Write-Up
Author
Emily Zou
Published
October 7, 2023
Project Updates
If you remember where I was at last week, you’ll see that my project has once again dramatically changed.
I was broadly interested in how unique features of online communities affects its members’ reactions and/or resistance to broader changes– specifically, social movements. I was particularly intrigued as to whether approaching online behavior as a network of communities, rather than an amalgation of individuals, could better explain why some movements succeed and others fail.
While working on fleshing out this idea, through cleaning up a dataset of YouTube comments from Hasan Piker’s video on Hogwarts Legacy, I noticed something else, however. A lot of keywords pulled from the comments used a distinct sort of language to make their point, for instance, one commenter wrote that,
“He was born under the sign of the cuck, but his Redditor side is in retrograde.”
This may read as nonsense to some, but is a commonplace statement in online communities, particularly gaming and streaming spaces that regularly refer to ‘cucks’ and ‘Redditors’ to describe something derogatorily. While it is certainly silly, I also see it as an example of how gaming and streaming slang has expanded to describe and make meaning outside of its original topics. I became very curious about the nature of this sort of communication.
Brief Context
Statistically, we know that 82% of American adults get some of their news digitally. Anecdotally, I know that many people get their news through streamer and gamer personalities, similar to how we are more likely to keep up with Evanston and Illinois news, rather than California or China’s. I wonder if the structure of our mental models of news– the events, controversies, celebrities, and policies that we are aware of– shape our means to engage and interpret news, the sustained flow of information. In this case, I am interested in how the communication tools that users acquire through participating in online communities and digital subcultures are being employed to understand and discuss American news– or more accurately to this project, news that is not strictly about gaming and streaming. This sort of work– which emphasizes the networked level of online spaces– would draw on social computing work on online ecologies to reevaluate existing research on media literacy, misinformation, and digital polarization.
Data Background
In any case, what I did here was to test my intuitions to see what I’m actually working with here. Putting Hasan Piker’s channel aside for a moment, I took up two streamers, who are primarily gamers, that I know also cover news on at least a weekly basis:
MoistCritikal (Charlie White) has 13.8 million subscribers on YouTube and, before switching to YouTube streaming in September, was consistently one of Twitch’s top streamers. Other than posting about video games and gaming drama, his other content ranges from U.S politics to anime reviews to ranking Greek Gods.
Mogul Mail is the “news channel” for online streamer Ludwig Ahgren, who streams on YouTube now to 5.41 million subscribers, but had previously gained attention through Twitch. He notably broke the previous record for most paid subscribers after streaming for 31 days straight in 2021.
I collected Ludwig and Charlie’s videos from the past year to find topics they both covered:
1.Elon Musk buys Twitter
2.Andrew Tate
3.US releases UFO report
4.Unity (gaming company) releases new policies
Data Collection
The reason why I’m including the process in an assignment ostensibly just about the source, is that the basis of my later research and analysis will be based on the means by which I got my data. I’m not studying the textual content of these YouTube videos and comments (for now), but rather the broader patterns that their thousands of comments.
I’ll start with the Twitter one here. I used YouTube Data Tools to collect all the comments left under three videos:
The first two are obviously very similar– but I would like to refine and/or expand how I choose my ‘non-gamer’ news sources in the future. I chose an American mainstream news channel that covered the Elon Musk/Twitter event in the beginning of November last year. I found that a mainstream news channels post /a lot/, everyday, and their videos tend to be more specific. This makes sense, as they have a much larger team pushing out content everyday, as compared to a single streamer. CNN’s angle– specifically on Musk’s reponse to employees– doesn’t serve me well here if I were to make comparisons without considering this context. However, knowing this is interesting in itself: that people can get their news from broad informal summaries rather than the typical news format.
import pytextrank, spacyimport scattertext as stimport numpy as npimport pandas as pdfrom scattertext import SampleCorpora, produce_scattertext_explorerfrom scattertext import produce_scattertext_htmlfrom scattertext.CorpusFromPandas import CorpusFromPandasimport IPythonimport nltknltk.download('punkt')from nltk.corpus import stopwords
/Users/emilyzou/opt/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/emilyzou/nltk_data...
[nltk_data] Package punkt is already up-to-date!
After reading in our three datasets (Mogul Mail, Moistcritikal, CNN), I made three new dataframes to work with, since we’re comparing language across two sources, so there is one dataframe from each pairing. Then, we have to do some preprocessing, filtering out words that aren’t going to be helpful for us.
In the future, I’d like to be more careful with this and spend some more time on this part. What I did here is probably not optimal and also not replicable if I end up doing this with more videos, rather than manually. In any case, I first ran all the code and looked through the final graphs, making a list of words that I didn’t think were very informative. I put them in a list called “stopward” to filter out. You can see how this is cheating a bit.
Then, I made some more general functions for filtering. There is a nice list of the most common words in English language, such as ‘as’ or ‘of’ or ‘and’ that I used, as well as getting rid of words with only two characters and numbers.
def tokenize(tea):return nltk.tokenize.word_tokenize(tea)nami ['tokens'] = nami ['text'].apply(tokenize)sanji ['tokens'] = sanji ['text'].apply(tokenize)zoro ['tokens'] = zoro ['text'].apply(tokenize)def garp (list): lista = [i for i inlistif i.isalpha() ==True] liste = [i for i in lista iflen(i) >2] listf = [i for i in liste if i notin stopward]return [i for i in listf if i notin stopwords.words('english')]nami ['tony'] = nami ['tokens'].apply(garp)sanji ['tony'] = sanji ['tokens'].apply(garp)zoro ['tony'] = zoro ['tokens'].apply(garp)
def backstring (list): return' '.join(str(x) for x inlist)nami ['tony'] = nami ['tony'].apply(backstring)sanji ['tony'] = sanji ['tony'].apply(backstring)zoro ['tony'] = zoro ['tony'].apply(backstring)
The caveat with the CNN video topic is apparent here– which is something we will tune for later. Data points such as “employees” or “work” don’t really tell us much. This can be fixed with more corpuses of data– what we can catch glimpses of different languages: on “Ludwig’s side”, we see more acronyms/slang such as “imo” and “ahh.” The overall topic differences are also expected, such as the unique usage of words like “streamer”, “chat”, and “cringe.”
At first glance, the difference in plot shapes is obvious, compared to CNN/Ludwig, this plot is much more flat, meaning that the language across the two documents were more similar to each other. Of course, the videos were released on the same day, with the same name, but the common linguistics are suggestive: “zuck”, “boomer”, “core”. These streamer to streamer comparisons would be useful in identifying language and terms that are commonly used across different channels. Also important to note is that in the future, I would want to equalize the document and word counts.
This plot is the most polarized of the three– the comments on MoistCritikal’s side are much more explicit. Terms like “dickriding”, “clout”, and “amogi” are all unique, compared to CNN’s comments, which was more likely to mention proper nouns.
These plots were fun to make, but not terribly informative by themselves– we can get a broad idea of where things stand, but how language is actually used will require much more investigation. Other than the problems I already raised when selecting these sources (Difference in topic, difference in commenting amount), I also want to spend more time thinking about the implications of which channels I select from.
In future work, other than refining the process overall, I also plan to actually sample the comments for what they actually say, or how they are used– which is much harder to do computationally. (At least as I am aware of). When it comes to informal gaming/streaming slang, I also need to be careful with how I represent what they mean. Terms like “omegalul” and “kekw” have never been formally defined, so it is difficult to safely argue that a user meant one thing or another when using those words.